t--- title: "HW06" output: github_document ---

In cancer, mutations arise that promote growth or survival of cells. In glioblastoma multiforme and other high grade gliomas, a common mutation is a mutation of the 27th lysine (K) to a methionine (M) of the histone subunit H3, or in short H3K27M.

H3K27M is the most frequent oncohistone in brain cancers, but the biology is still not well understood. Your analysis is to look at the expression of several (27) genes to see if they are differentially expressed and plot 27 boxplots each gene. The data used in this analysis was obtained from this publication

Steps:

  1. Read in the 45 processed RNA-Seq data found in "./RNA_Seq_processed"
  2. Map gene.ids to gene.symbols (which I have for you)
  3. For 27 genes of interest AND your favorite gene, perform a t-test to see if it is differentially expressed between the WT vs H3K27M samples
  4. Create a graphing function and then create a boxplot that graphs expression between the two groups

Code

From the RNA-Seq files, you only need the "Name" from one file and the "TPM" column from all the files. TPM stands for "transcripts per million" and is a common unit for normalized expression data.

library(tidyverse)
library(readr)
library(stringr)
library(knitr)

#hint, using apply (specifically sapply) you can read in the data into a list and then bind the columns together. Or you can use a for loop too. 

#you only need the 

# Create a function to read the specific columns
read_col <- function(data){
  file <- read_tsv(data)
  return(file[["TPM"]])
}

# Get all the file directories
RNA_files <- c(str_c("RNA_Seq_processed/H3K27M/",list.files("RNA_Seq_processed/H3K27M")), str_c("RNA_Seq_processed/WT/",list.files("RNA_Seq_processed/WT")))

# Read and combine the "TPM" column of all the files
GBM.transcripts <- data.frame(sapply(RNA_files,read_col))

# Get the "name" column
file1 <- read_tsv(RNA_files[[1]])

# Combine the "name" column and the "TPM" columns
GBM.transcripts$gene_id <- file1$Name

Now, install the packages commented below (if needed), and then use this code to map the transcript IDs to gene symbols. To use this code, you need a dataframe called GBM.transcripts that has the first column "gene_id" that contains the transcript ids (e.g. ENST00000456328.2) and the remaining columns contain the TPM data. So long as the first column contains the "gene_id" column as mentioned above, this should run.

#install.packages("BiocManager")
#BiocManager::install("ensembldb")
#BiocManager::install("EnsDb.Hsapiens.v75")
library(ensembldb)
library(EnsDb.Hsapiens.v75)

ens.GBM.transcripts <- GBM.transcripts %>% 
  mutate(gene_id = gsub(pattern = "\\..*", "", .$gene_id))

map <- ensembldb::select(EnsDb.Hsapiens.v75, keys = ens.GBM.transcripts$gene_id,
                         keytype = "TXID", columns = c("SYMBOL", "TXID"))

ens.mapped_GBM <- left_join(ens.GBM.transcripts, map, by = c("gene_id" = "TXID")) %>% 
  dplyr::select(-1) %>% 
  dplyr::select(gene_symbol = SYMBOL, everything())

ens.mapped_GBM <- ens.mapped_GBM[!duplicated(ens.mapped_GBM$gene_symbol),] #remove duplicated gene symbols
  #these are removed instead of averaged because they simply do not correlate particularly well. 
ens.mapped_GBM <- ens.mapped_GBM[!is.na(ens.mapped_GBM$gene_symbol),] #remove NA values

Do the t-test and make a table of the t-test results!

#run this code to unload the libraries from before, it might be helpful because the select() function from dplyr might be hidden otherwise
detach(package:EnsDb.Hsapiens.v75, unload = T)
detach(package:ensembldb, unload = T)

#add in your own gene of interest!!! 
genes_of_interest <- c("OR4G4P", "IRX1", "OSR1", "DCHS2", "BRINP3", "TOB2P1", "FOXD1", "ZFPM2", "GLB1", "ALG5", "TRIM4", "ADARB2", "PCDHGA11", "IDH1", "EGFR", "MGMT", "TERT", "PTEN", "TP53", "RB1", "ATRX", "PDGFRA", "PIK3CA", "MICA", "CDKN2A", "EZH2", "BRD2")

# I found there are 2 "PTEN", so I deleted one. Hope that is okay.

GBM.genes.of.interest <- filter(ens.mapped_GBM, gene_symbol %in% genes_of_interest)

#Now perform a t-test between the H3K mutated and the wt samples. There are many ways to do this actually, you can use a for loop or you could do the tidy alternative with broom(), but the for loop is probably the easiest

H3K27M <- GBM.genes.of.interest %>% 
  dplyr::select(contains(".H3K27M."), gene_symbol) %>% 
  gather(key = "Sample", value = "Expression27", -gene_symbol) %>% 
  mutate(type = "H3K27M")
  
WT <- GBM.genes.of.interest %>% 
    dplyr::select(contains(".WT."), gene_symbol) %>% 
  gather(key = "Sample", value = "Expression27", -gene_symbol) %>% 
  mutate(type = "WT")

GBM.genes.of.interest2 <- rbind(H3K27M, WT)

t_results <- list()
for (i in genes_of_interest){
  gene_temp <- GBM.genes.of.interest2[GBM.genes.of.interest2$gene_symbol == i,]
  t_results[[i]] <- t.test(gene_temp$Expression27[gene_temp$type == "H3K27M"],gene_temp$Expression27[gene_temp$type == "WT"])
}


# Make a table to store the results
t_results_sum <- data.frame(genes_of_interest,sapply(t_results, function(l) l[["statistic"]]), sapply(t_results, function(l) l[["p.value"]]))

colnames(t_results_sum)[2:3] <- c("t_statistic", "p_value")
rownames(t_results_sum) <- NULL

#print out the t-test results
kable(t_results_sum)
genes_of_interest t_statistic p_value
OR4G4P -0.7522158 0.4589634
IRX1 5.1336629 0.0000099
OSR1 5.6829774 0.0000034
DCHS2 5.7324096 0.0000046
BRINP3 5.1194522 0.0000159
TOB2P1 -4.3860210 0.0001389
FOXD1 -4.4889072 0.0001679
ZFPM2 4.5714175 0.0001077
GLB1 -3.9100261 0.0004298
ALG5 -4.4610902 0.0000905
TRIM4 -3.7724880 0.0005934
ADARB2 6.2938237 0.0000010
PCDHGA11 -1.1719705 0.2541779
IDH1 -1.5486643 0.1313748
EGFR -1.4214767 0.1691524
MGMT 0.9184095 0.3636800
TERT -1.2419552 0.2262129
PTEN -2.2353246 0.0315199
TP53 -0.0400511 0.9682425
RB1 -1.6058411 0.1208662
ATRX -0.9777387 0.3343672
PDGFRA -0.2277656 0.8215680
PIK3CA -0.2722628 0.7867548
MICA -2.3374256 0.0293613
CDKN2A -2.1599041 0.0423925
EZH2 0.0030369 0.9975928
BRD2 1.4820614 0.1472154

Now create a graphing function to create boxplots to visualize the results. Plot expression on the y-axis. The graph should look like this example

#to work in the tidyverse, it will be easier to make tidy the dataframe first

GBM.genes.of.interests_split <- GBM.genes.of.interest2 %>% 
  group_split(gene_symbol)

#create a graphing function
my_plot <- function(data){
  p_temp <- ggplot(data, aes(x = type, y = Expression27, fill = type)) +
  geom_boxplot() +
  labs(title = paste0(data$gene_symbol[1]," Expression in GBM models \n by H3K27 Mutated or WT Status"), x = "H3K27", y = "Expression_(TPM)") +
  theme(plot.title = element_text(hjust = 0.5),legend.position = "none") +
  scale_x_discrete(labels = c('H3K27_Mutated','WT'))
  
  print(p_temp)
}

#then use a for loop combined with the graphing function to make a graph for all your genes of interest 
lapply(GBM.genes.of.interests_split, my_plot)

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] AnnotationFilter_1.12.0 GenomicFeatures_1.40.1  AnnotationDbi_1.50.3   
##  [4] Biobase_2.48.0          GenomicRanges_1.40.0    GenomeInfoDb_1.24.2    
##  [7] IRanges_2.22.2          S4Vectors_0.26.1        BiocGenerics_0.34.0    
## [10] knitr_1.29              forcats_0.5.0           stringr_1.4.0          
## [13] dplyr_1.0.1             purrr_0.3.4             readr_1.3.1            
## [16] tidyr_1.1.1             tibble_3.0.3            ggplot2_3.3.2          
## [19] tidyverse_1.3.0        
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.20.0         matrixStats_0.56.0         
##  [3] bitops_1.0-6                fs_1.5.0                   
##  [5] lubridate_1.7.9             bit64_4.0.2                
##  [7] progress_1.2.2              httr_1.4.2                 
##  [9] tools_4.0.2                 backports_1.1.8            
## [11] R6_2.4.1                    lazyeval_0.2.2             
## [13] DBI_1.1.0                   colorspace_1.4-1           
## [15] withr_2.2.0                 tidyselect_1.1.0           
## [17] prettyunits_1.1.1           bit_4.0.4                  
## [19] curl_4.3                    compiler_4.0.2             
## [21] cli_2.0.2                   rvest_0.3.6                
## [23] xml2_1.3.2                  DelayedArray_0.14.1        
## [25] labeling_0.3                rtracklayer_1.48.0         
## [27] scales_1.1.1                askpass_1.1                
## [29] rappdirs_0.3.1              digest_0.6.25              
## [31] Rsamtools_2.4.0             rmarkdown_2.3              
## [33] XVector_0.28.0              pkgconfig_2.0.3            
## [35] htmltools_0.5.0             highr_0.8                  
## [37] dbplyr_1.4.4                rlang_0.4.7                
## [39] readxl_1.3.1                rstudioapi_0.11            
## [41] RSQLite_2.2.0               farver_2.0.3               
## [43] generics_0.0.2              jsonlite_1.7.0             
## [45] BiocParallel_1.22.0         RCurl_1.98-1.2             
## [47] magrittr_1.5                GenomeInfoDbData_1.2.3     
## [49] Matrix_1.2-18               Rcpp_1.0.5                 
## [51] munsell_0.5.0               fansi_0.4.1                
## [53] lifecycle_0.2.0             stringi_1.4.6              
## [55] SummarizedExperiment_1.18.2 zlibbioc_1.34.0            
## [57] BiocFileCache_1.12.1        grid_4.0.2                 
## [59] blob_1.2.1                  crayon_1.3.4               
## [61] lattice_0.20-41             Biostrings_2.56.0          
## [63] haven_2.3.1                 hms_0.5.3                  
## [65] pillar_1.4.6                biomaRt_2.44.1             
## [67] reprex_0.3.0                XML_3.99-0.5               
## [69] glue_1.4.1                  evaluate_0.14              
## [71] modelr_0.1.8                vctrs_0.3.2                
## [73] cellranger_1.1.0            gtable_0.3.0               
## [75] openssl_1.4.2               assertthat_0.2.1           
## [77] xfun_0.16                   broom_0.7.0                
## [79] GenomicAlignments_1.24.0    memoise_1.1.0              
## [81] ellipsis_0.3.1